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Abstract 

Most of previous work in knowledge base 
(KB) completion has focused on the problem 
of relation extraction. In this work, we focus 
on the task of inferring missing entity type in¬ 
stances in a KB, a fundamental task for KB 
competition yet receives little attention. 

Due to the novelty of this task, we construct 
a large-scale dataset and design an automatic 
evaluation methodology. Our knowledge base 
completion method uses information within 
the existing KB and external information from 
Wikipedia. We show that individual methods 
trained with a global objective that consid¬ 
ers unobserved cells from both the entity and 
the type side gives consistently higher qual¬ 
ity predictions compared to baseline methods. 
We also perform manual evaluation on a small 
subset of the data to verify the effectiveness 
of our knowledge base completion methods 
and the correctness of our proposed automatic 
evaluation method. 


1 Introduction 


There is now increasing interest in the construction 


of knowledge bases like Freebase (Bollacker et ah, 


2008| ) and NELL ( [Carlson et ah, 20T0l ) in the nat¬ 


ural language processing community. KBs contain 
facts such as Tiger Woods is an athlete, and Barack 
Obama is the president of USA. However, one of the 
main drawbacks in existing KBs is that they are in¬ 


complete and are missing important facts (West et 


* Most of the research conducted during summer internship 
at Microsoft. 


ah, 20141, jeopardizing their usefulness in down¬ 


stream tasks such as question answering. This has 
led to the task of completing the knowledge base 
entries, or Knowledge Base Completion (KBC) ex¬ 
tremely important. 

In this paper, we address an important subprob¬ 
lem of knowledge base completion— inferring miss¬ 
ing entity type instances. Most of previous work 
in KB completion has only focused on the problem 
of relation extraction ( jMintz et ah, 20091 [Nickel et] 


[ah, 201 1[ Bordes et ah, 2013 

Riedel et ah, 2013). 

Entity type information is crucial in KBs and is 
widely used in many NLP tasks such as relation 

extraction (Chang et ah, 2014 

), coreference reso- 

lution ( 

Ratinov and Roth, 2012[ Hajishirzi et ah. 


20131, entity linking (Fang and Chang, 2014), se¬ 


mantic parsing (Kwiatkowski et ah, 2013 Berant 


et ah, 20131 and question answering (Bordes et ah. 


2014 Yao and Durme, 2014). For example, adding 


entity type information improves relation extraction 


by 3% ( [Chang et ah, 20f4[ ) and entity linking by 
4.2 FI points ( |Guo et ah, 2013[ ). Despite their im¬ 
portance, there is surprisingly little previous work 
on this problem and, there are no datasets publicly 
available for evaluation. 

We construct a large-scale dataset for the task of 
inferring missing entity type instances in a KB. Most 
of previous KBC datasets ( Mintz et ah, 2009| Riedel 


et ah, 2013) are constructed using a single snapshot 


of the KB and methods are evaluated on a subset 
of facts that are hidden during training. Hence, the 
methods could be potentially evaluated by their abil¬ 
ity to predict easy facts that the KB already contains. 
Moreover, the methods are not directly evaluated 



































































Jean Metellus 

mM: /rnA)hf2h7b "« ♦■**** types /peopie/deceesed_per^ for /peop|^deceased_peraon on the web: ^ ¥dkipedia.org 

Jeen Metetlus was a Haitian neurolo^st, poet, novelist and f^aywrlght 


Figure 1: Freebase description of Jean Metellus can be used to infer that the entity has the type /book/author. This 
missing fact is found by our algorithm and is still missing in the latest version of Freebase when the paper is written. 


on their ability to predict missing facts. To over¬ 
come these drawbacks we construct the train and 
test data using two snapshots of the KB and evaluate 
the methods on predicting facts that are added to the 
more recent snapshot, enabling a more realistic and 
challenging evaluation. 

Standard evaluation metrics for KBC methods are 
generally type-based (Mintz et ah, 200^ Riedel et 


ah, 20131, measuring the quality of the predictions 


by aggregating scores computed within a type. This 
is not ideal because: (1) it treats every entity type 
equally not considering the distribution of types, (2) 
it does not measure the ability of the methods to rank 
predictions across types. Therefore, we additionally 
use a global evaluation metric, where the quality of 
predictions is measured within and across types, and 
also accounts for the high variance in type distri¬ 
bution. In our experiments, we show that models 
trained with negative examples from the entity side 
perform better on type-based metrics, while when 
trained with negative examples from the type side 
perform better on the global metric. 

In order to design methods that can rank pre¬ 
dictions both within and across entity (or relation) 
types, we propose a global objective to train the 
models. Our proposed method combines the ad¬ 
vantages of previous approaches by using nega¬ 
tive examples from both the entity and the type 
side. When considering the same number of nega¬ 
tive examples, we find that the linear classifiers and 
fhe low-dimensional embedding models frained wifh 
the global objective produce better quality ranking 
within and across entity types when compared to 
training with negatives examples only from entity or 
type side. Additionally compared to prior methods, 
the model trained on the proposed global objective 
can more reliably suggest confident entity-type pair 
candidates that could be added into the given knowl¬ 
edge base. 

Our contributions are summarized as follows: 


• We develop an evaluation framework com¬ 
prising of methods for dataset construction 
and evaluation metrics to evaluate KBC 
approaches for missing entity type in¬ 
stances. The dataset and evaluation scripts are 
publicly available at http://research, 
microsoft.com/en-US/downloads/ 
df481862-65cc-4b05-886c-accl81ad07bb/ 
default.aspx 

• We propose a global training objective for KBC 
methods. The experimental results show that 
both linear classifiers and low-dimensional em¬ 
bedding models achieve besf overall perfor¬ 
mance when frained with the global objective 
function. 

• We conduct extensive studies on models for in¬ 
ferring missing type instances studying the im¬ 
pact of using various features and models. 

2 Inferring Entity Types 

We consider a KB A containing entity type informa¬ 
tion of the form (e, t), where e ^ E {E is the set of 
all entities) is an entity in the KB with type t £ T (T 
is the set of all types). For example, e could be Tiger 
Woods and t could be sports athlete. As a single 
entity can have multiple types, entities in Freebase 
often miss some of their types. The aim of this work 
is to infer missing entity type instances in the KB. 
Given an unobserved fact (an entity-type pair) in the 
training data (e, f) 0 A where entity e £ E and type 
t £ T, the task is to infer whether the KB currently 
misses the fact, i.e., infer whether (e, t) £ A. We 
consider entities in the intersection of Freebase and 
Wikipedia in our experiments. 

2.1 Information Resources 

Now, we describe the information sources used to 
construct the feature representation of an entity to 











infer its types. We use information in Freebase and 
external information from Wikipedia to complete 
the KB. 

• Entity Type Features: The entity types ob¬ 
served in the training data can be a useful sig¬ 
nal to infer missing entity type instances. For 
example, in our snapshot of Freebase, it is not 
uncommon to find an entity with the type /peo¬ 
ple/deceased person but missing the type /peo- 
ple/person. 

• Freebase Description: Almost all entities in 
Freebase have a short one paragraph descrip¬ 
tion of the entity. Figure [T] shows the Freebase 
description of Jean Metellus that can be used 
to infer the type /book/author which Freebase 
does not contain as the date of writing this arti¬ 
cle. 

• Wikipedia: As external information, we in¬ 
clude the Wikipedia full text article of an en¬ 
tity in its feature representation. We con¬ 
sider entities in Freebase that have a link to 
their Wikipedia article. The Wikipedia full text 
of an entity gives several clues to predict it’s 
entity types. For example. Figure shows 
a section of the Wikipedia article of Claire 
Martin which gives clues to infer the type 
/award/award-Winner that Freebase misses. 


3 Evaluation Framework 


In this section, we propose an evaluation methodol¬ 
ogy for the task of inferring missing entity type in¬ 
stances in a KB. While we focus on recovering entity 
types, the proposed framework can be easily adapted 
to relation extraction as well. 

First, we discuss our two-snapshot dataset con¬ 
struction strategy. Then we motivate the importance 
of evaluating KBC algorithms globally and describe 
the evaluation metrics we employ. 


3.1 Two Snapshots Construction 

In most previous work on KB completion to pre¬ 


dict missing relation facts (Mintz et al., 2009t Riedel 


et al., 20131, the methods are evaluated on a subset 


of facts from a single KB snapshot, that are hidden 
while training. However, given that the missing en¬ 
tries are usually selected randomly, the distribution 


of the selected unknown entries could be very differ¬ 
ent from the actual missing facts distribution. Also, 
since any fact could be potentially used for evalua¬ 
tion, the methods could be evaluated on their ability 
to predict easy facts that are already present in the 
KB. 

To overcome this drawback, we construct our 
train and test set by considering two snapshots of the 
knowledge base. The train snapshot is taken from 
an earlier time without special treatment. The test 
snapshot is taken from a later period, and a KBC 
algorithm is evaluated by its ability of recovering 
newly added knowledge in the test snapshot. This 
enables the methods to be directly evaluated on facts 
that are missing in a KB snapshot. Note that the 
facts that are added to the test snapshot, in general, 
are more subtle than the facts that they already con¬ 
tain and predicting the newly added facts could be 
harder. Hence, our approach enables a more realis¬ 
tic and challenging evaluation setting than previous 
work. 

We use manually constructed Freebase as the KB 
in our experiments. Notably, [Chang et al. (2014 1 use 
a two-snapshot strategy for constructing a dataset for 
relation extraction using automatically constructed 
NELL as their KB. The new facts that are added to 
a KB by an automatic method may not have all the 
characteristics that make the two snapshot strategy 
more advantageous. 

We construct our train snapshot Aq by taking the 
Freebase snapshot on 3^*^ September, 2013 and con¬ 
sider entities that have a link to their Wikipedia page. 
KBC algorithms are evaluated by their ability to pre¬ 
dict facts that were added to the 1^* June, 2014 snap¬ 
shot of Freebase A. To get negative data, we make 
a closed world assumption treating any unobserved 
instance in Freebase as a negative example. Un¬ 
observed instances in the Freebase snapshot on 3’’'^ 
September, 2013 and 1** June, 2014 are used as neg¬ 
ative examples in training and testing respectively|^ 

The positive instances in the test data (A — Aq) are 
facts that are newly added to the test snapshot A. Us¬ 
ing the entire set of negative examples in the test data 
is impractical due to the large number of negative ex¬ 
amples. To avoid this we only add the negative types 


'Note that some of the negative instances used in training 
could be positive instances in test but we do not remove them 
during training. 









Life and career [edit] 

Martin was bom in Wimbledon, London. She grew up in a house lull of mustc", and claims to have learned all of Judy 
Garland's songs by the time she was 12. She cites Ella Fitzgerald's Song Books as being the life changing influence which 

inspired her to attend stage school and later to study singing in both New York and London. Her professional career started with her first engagement, aboard the QE2. 
where she sang in the Theater Bar for two years. 

At the age of 21, Martin formed her own jazz quartet. In 1991, she was signed by the Scottish jazz label Linn Records, and her debut album. The Waiting Game, was 
released in 1992. The album was well reviewed and was selected by The Times as one of their "Albums of the Year". Later that year, she opened for Tony Bennett at the 
Glasgow International Jazz Festival. 

Martin continued performing and recording, gamering numerous awards and rave re\news throughout the 1990s ar>d early 2000s. including wins at the BBC Jazz Awards 
for Best Vocalist, and six wins at the British Jazz Awards. She has released a total of thirteen albums, all on the Linn label, and has collaborated with various prominent 
n>uslcians including Martin Taytor, John Martyn, Stephana Grappelli. Mark Nightngale, Sir Richard Rodney Bennett. Jim Mullen and Nigel Hitchcock. In addition to her 
singing career, she is also a co-presenter for Jazz Line Up on BBC Radio 3. 

Figure 2; A section of the Wikipedia article of Claire Martin which gives clues that entity has the type 
/award/awardjwinner. This currently missing fact is also found by our algorithm. 


of entities that have at least one new fact in the test 
data. Additionally, we add a portion of the negative 
examples for entities which do not have new fact in 
the test data and that were unused during training. 
This makes our dataset quite challenging since the 
number of negative instances is much larger than the 
number of positive instances in the test data. 

It is important to note that the goal of this 
work is not to predict facts that emerged between 
the time period of the train and test snapshoj^ 
For example, we do not aim to predict the type 
/award/award-Winner for an entity that won an 
award after 3'’’^ September, 2013. Hence, we use 
the Freebase description in the training data snap¬ 
shot and Wikipedia snapshot on 3'”'^ September, 2013 
to get the features for entities. 

One might worry that the new snapshot might 
contain a significant amount of emerging facts so 
it could not be an effective way to evaluate the 
KBC algorithms. Therefore, we examine the differ¬ 
ence between the training snapshot and test snap¬ 
shot manually and found that this is likely not 
the case. For example, we randomly selected 25 
/award/award -Winner instances that were added to 
the test snapshot and found that all of them had won 
at least one award before 3^'^ September, 2013. 

Note that while this automatic evaluation is closer 
to the real-world scenario, it is still not perfect as the 
new KB snapshot is still incomplete. Therefore, we 
also perform human evaluation on a small dataset to 
verify the effectiveness of our approach. 


^In this work, we also do not aim to correct existing false 
positive errors in Freebase 


3.2 Global Evaluation Metric 


Mean average precision (MAP) (Manning et ah. 


20081 is now commonly used to evaluate KB com¬ 


pletion methods (Mintz et ah, 2009 Riedel et ah, 
2013| |. MAP is defined as fhe mean of average pre¬ 
cision over all enfify (or relafion) fypes. MAP freafs 
each enfify fype equally (nol explicifly accounting 
for fheir disfribufion). However, some fypes occur 
much more frequenfly fhan ofhers. For example, 
in our large-scale experimenf wifh 500 enfify fypes, 
fhere are many enfify fypes wifh only 5 insfances in 
fhe fesf sef while fhe mosf frequenl enfify fype has 
fens of fhousands of missing insfances. Moreover, 
MAP only measures fhe abilify of fhe mefhods fo 
correcfly rank predicfions wifhin a fype. 

To accounf for fhe high variance in fhe disfribu¬ 
fion of enfify fypes and measure fhe abilify of fhe 
mefhods fo correcfly rank predictions across types 
we use global average precision (GAP) (similarly 
fo micro-Fl) as an addifional evaluafion mefric for 
KB completion. We converf fhe mulfi-label classi- 
ficafion problem fo a binary classificafion problem 
where fhe label of an enfify and fype pair is frue if 
fhe enfify has fhaf fype in Freebase and false ofh- 
erwise. GAP is fhe average precision of fhis frans- 
formed problem which can measure fhe abilify of fhe 
mefhods fo rank predicfions bofh wifhin and across 
enfify fypes. 


Prior fo us, Bordes ef al. (20131 use mean recip¬ 


rocal rank as a global evaluafion mefric for a KBC 
fask. We use average precision insfead of mean re¬ 
ciprocal rank since MRR could be biased fo fhe fop 


predicfions of fhe mefhod (Wesf ef ah, 2014 1 


While GAP capfures global ordering, if would be 













beneficial to measure the quality of the top k pre¬ 
dictions of the model for bootstrapping and active 
learning scenarios (Lewis and Gale, 1994[ Cucerzan 


and Yarowsky, 19991. We report G@k, GAP mea¬ 


sured on the top k predictions (similarly to Preci- 
sion@k and Hits@k). This metric can be reliably 
used to measure the overall quality of the top k pre¬ 
dictions. 


4 Global Objective for Knowledge Base 
Completion 

We describe our approach for predicting missing en¬ 
tity types in a KB in this section. While we focus 
on recovering entity types in this paper, the meth¬ 
ods we develop can be easily extended to other KB 
completion tasks. 


4.1 Global Objective Framework 


During training, only positive examples are ob¬ 
served in KB completion tasks. Similar to previous 


work ( Mintz et aL, 2009[|Bordes et al., 2013[|Riedel| 


et al., 20131, we get negative training examples by 
treating the unobserved data in the KB as negative 
examples. Because the number of unobserved ex¬ 
amples is much larger than the number of facts in 
the KB, we follow previous methods and sample few 
unobserved negative examples for every positive ex¬ 
ample. 

Previous methods largely neglect the sampling 
methods on unobserved negative examples. The pro¬ 
posed global object framework allows us to system¬ 
atically study the effect of the different sampling 
methods to get negative data, as the performance of 
the model for different evaluation metrics does de¬ 
pend on the sampling method. 

We consider a training snapshot of the KB Aq, 
containing facts of the form (e, t) where e is an en¬ 
tity in the KB with type t. Given a fact (e, t) in 
the KB, we consider two types of negative examples 
constructed from the following two sets: A/^Ce, t) is 
the “negative entity set”, and A/r(e, t) is the “nega¬ 
tive type set”. More precisely. 


A/E(e, t) C {e'|e' £ E,e ^ e, (e', t) ^ Aq}, 

and 


Mrie, t) C £T,t' ^ t, (e, t') ^ Aq}. 

Let Q be the model parameters, m = |A/E(e,t)| 
and n = |A/’T(e, t) \ be the number of negative exam¬ 
ples and types considered for training respectively. 
For each entity-type pair (e, t), we define the scor¬ 
ing function of our model as s(e,t|0)|^ We define 
two loss functions one using negative entities and 
the other using negative types: 

Le{Eo,0)= ^ [s(e',t) - s(e,t) + 1] + , 

(e,t)eAo,e'eA/£;(e,t) 

and 

Lt{Ao,0)= [s(e,t') -'S(e,t)-h 1]^, 

(e,t)eAo,i'eA/T(e,t) 

where k is the power of the loss function {k can be 1 
or 2), and the function [•]_!_ is the hinge function. 

The global objective function is defined as 

vcmiReg{9) + CLt{Eo,9) + CLe{Rq,9), (1) 
6 

where Reg{9) is the regularization term of the 
model, and C is the regularization parameter. In¬ 
tuitively, the parameters 9 are estimated to rank the 
observed facts above the negative examples with a 
margin. The total number of negative examples is 
controlled by the size of the sets Me and Mt- We 
experiment by sampling only entities or only types 
or both by fixing the total number of negative exam¬ 
ples in Section 

The rest of section is organized as follows: we 
propose three algorithms based on the global objec¬ 
tive in SectionIn Sectionwe discuss the re¬ 
lationship between the proposed algorithms and ex¬ 
isting approaches. Let <h(e) —)■ R‘^‘^ be the feature 
function that maps an entity to its feature represen¬ 
tation, and 'k(t) R‘^* be the feature function that 

maps an entity type to its feature representation]^ df, 
and dt represent the feature dimensionality of the en¬ 
tity features and the type features respectively. Fea¬ 
ture representations of the entity types (T') is only 
used in the embedding model. 

^We often use s(e, t) as an abbreviation of s(e, t\d) in order 
to save space. 

"'This gives the possibility of defining features for the labels 
in the output space but we use a simple one-hot representation 
for types right now since richer features did not give perfor¬ 
mance gains in our initial experiments. 










Algorithm 1 The training algorithm for Lin- 
ear.Adagrad. 

1: Initialize Wj = 0, Vt = 1... |T| 

2: for (e, t) £ Ao do 

3: for e'G A/£;(e, t) do 

4: if wf 4>(e) — 4?(e') — 1 < 0 then 

5: AdaGradUpdate(wt, 4?(e') — “hie)) 

6: end if 

7: end for 

8: for t' G A/rle, t) do 

9: if wf 4>(e) — w^<I>(e) — 1 < 0 then 

10: AdaGradUpdate(wt, — 4>(e)) 

11: AdaGradUpdate(wt/, $(e)). 

12: end if 

13: end for 

14: end for 


4.2 Algorithms 

We propose three different algorithms based on the 
global objeetive framework for predicting missing 
entity types. Two algorithms use the linear model 
and the other one uses the embedding model. 


Linear Model The scoring function in this model 
is given by s{e,t\9 = {w^}) = wj’<h(e), where 
wt £ is the parameter vector for target type 
t. The regularization term in Eq. o is defined as 
follows: R{0) = 1/2 We use A: = 2 in 

our experiments. Our first algorithm is obtained by 


using the dual coordinate descent algorithm (Hsieh 


et ah, 20081 to optimize Eq. Q, where we modi¬ 


fied the original algorithm to handle multiple weight 
vectors. We refer to this algorithm as Linear.DCD. 

While DCD algorithm ensures convergence to the 
global optimum solution, its convergence can be 
slow in certain cases. Therefore, we adopt an on¬ 
line algorithm, Adagrad ( Duchi et ah, 20 iT) . We 
use the hinge loss function (k = 1) with no regu¬ 
larization {Reg{9) = 0) since it gave best results 
in our initial experiments. We refer to this algo¬ 
rithm as Linear.Adagrad, which is described in Al¬ 
gorithmic Note that AdaGradUpdate(x, g) is a pro¬ 
cedure which updates the vector x with the respect 
to the gradient g. 


Embedding Model In this model, vector repre¬ 
sentations are constructed for entities and types us¬ 
ing linear projection matrices. Recall 'l'(f) —)• 
is the feature function that maps a type to its feature 
representation. The scoring function is given by 


Algorithm 2 The training algorithm for the embed- 

ding model. 


1 

Initialize V, U randomly. 


2 

for (e, t) G Ao do 


3 

for e' G A/fife, t) do 


4 

if s(e, t) — s(e', t) — 1 < 0 then 


5 



6 

7?^U^(<f>(e')-$(e)) 


7 

for i G 1 ... d do 


8 

AdaGradUpdatefUi, /i[i](<l?(e') 

-c&fe))) 

9 

AdaGradUpdate(Vi, 77[i]4>(t)) 


10 

end for 


11 

end if 


12 

end for 


13 

for t' G A/rfe, t) do 


14 

if s(e, t) — s(e, t') — 1 < 0 then 


15 



16 

?7<- U^$(e) 


17 

for i G 1 ... d do 


18 

AdaGradUpdatefUi, rt[*]4?(e)) 


19 

AdaGradUpdate (Vi,J 7 [i](<I'(t') 


20 

end for 


21 

end if 


22 

end for 


23 

end for 



s{eR\9= (U, V)) = ^(f)^VU^$(e), 

where U G ij'ieXd Y G are projection 

matrices that embed the entities and types in a d- 
dimensional space. Similarly to the linear classifier 
model, we use the 11-hinge loss function {k = 1) 
with no regularization {Reg{9) = 0). Uj and Vj 
denote the i-th column vector of the matrix U and 
V, respectively. The algorithm is described in detail 
in Algorithmic 

The embedding model has more expressive power 
than the linear model, but the training unlike in the 
linear model, converges only to a local optimum so¬ 
lution since the objective function is non-convex. 


4.3 Relationship to Existing Methods 

Many existing methods for relation extraction and 
entity type prediction can be cast as a special case 
under the global objective framework. Eor exam¬ 
ple, we can consider the work in relation extrac- 


tion ( 

Mintz et ah, 2009 

Bordes et ah, 2013t|Riedel 

et ah, 2013 

1 as models trained with MrieR) = 0. 


These models are trained only using negative entities 
which we refer to as Negative Entity (NE) objective. 


The entity type prediction model in Eing and Weld 
(20121 is a linear model with AfE{e,t) = 0 which 






















70 types 

500 types 

Entities 

2.2M 

2.2M 


Training Data Statistics (Ag) 


positive example 

4.5M 

6.2M 

max #ent for a type 

I.IM 

I.IM 

min #ent for a type 

6732 

32 


Test Data Statistics (A — Ag) 


positive examples 

163K 

240K 

negative examples 

17.IM 

132M 

negative/positive ratio 

105.22 

554.44 


Table 1; Statistics of our dataset. Ag is our training snap¬ 
shot and A is our test snapshot. An example is an entity- 
type pair. 


we refer to as the Negative Type (NT) objective. The 
embedding model described in Weston et al. (20lT| | 
developed for image retrieval is also a special case 
of our model framed with the NT objective. 

While the A^£' or NT objective functions could 


be suitable for some classification tasks (Weston et 


al., 20111, the choice of objective functions for the 


KBC tasks has not been well motivated. Often the 
choice is made neither with theoretical foundation 
nor with empirical support. To the best of our knowl¬ 
edge, the global objective function, which includes 
both A/e (e, t) and A/T(e, t), has not been considered 
previously by KBC methods. 


5 Experiments 

In this section, we give details about our dataset and 
discuss our experimental results. Finally, we per¬ 
form manual evaluation on a small subset of the 
data. 


5,1 Data 

First, we evaluate our methods on 70 entity types 
with the most observed facts in the training dataj^ 
We also perform large-scale evaluation by testing the 
methods on 500 types with the most observed facts 
in the training data. 

Table [T] shows statistics of our dataset. The num¬ 
ber of positive examples is much larger in the train¬ 
ing data compared to that in the test data since the 
test set contains only facts that were added to the 
more recent snapshot. An additional effect of this is 

^We removed few entity types that were trivial to predict in 
the test data. 


that most of the facts in the test data are about en¬ 
tities that are not very well-known or famous. The 
high negative to positive examples ratio in the test 
data makes this dataset very challenging. 


5.2 Automatic Evaluation Results 

Table|2]shows automatic evaluation results where we 
give results on 70 types and 500 types. We compare 
different aspects of the system on 70 types empiri¬ 
cally. 

Adagrad Vs DCD We first study the linear mod¬ 
els by comparing Linear.DCD and Linear. AdaGrad. 


Table 2a shows that Linear.AdaGrad consistently 
performs better for our task. 

Impact of Features We compare the effect of 
different features on the final performance using 

Types are repre¬ 


Linear.AdaGrad in Table 2b 


sented by boolean features while Freebase descrip¬ 
tion and Wikipedia full text are represented using tf- 
idf weighting. The best MAP results are obtained by 
using all the information (T-i-D-i-W) while best GAP 
results are obtained by using the Freebase descrip¬ 
tion and Wikipedia article of the entity. Note that 
the features are simply concatenated when multiple 
resources are used. We fried to use idf weighting 
on type features and on all features, but they did not 
yield improvements. 


The Importance of Global Objective Table ^ 
and 1^ compares global training objective with NE 
and NT training objective. Note that all the three 
methods use the same number of negative examples. 
More precisely, for each (e,t) G Aq, \AfE{e,t)\ + 
|A/r(e,t)| = m + n = 2. The results show that 
the global training objective achieves best scores 
on both MAP and GAP for classifiers and low¬ 
dimensional embedding models. Among NE and 
NT, NE performs better on the type-based metric 
while NT performs better on the global metric. 

Linear Model Vs Embedding Model Finally, we 
compare the linear classifier model with the embed¬ 


ding model in Table 2e The linear classifier model 
performs better than the embedding model in both 
MAP and GAP. 

We perform large-scale evaluation on 500 types 
with the description features (as experiments are 
expensive) and the results are shown in Table 1^ 
































Features 

Algorithm 

MAP 

GAP 

Description 

Linear.Adagrad 
Linear. DCD 

29.17 

28.40 

28.17 

27.76 

Description H- 

Wikipedia 

Linear.Adagrad 
Linear. DCD 

33.28 

31.92 

31.97 

31.36 


(a) Adagrad vs. Dual coordinate descent (DCD). Results are 
obtained using linear models trained with global training ob¬ 
jective (m=l, n=l) on 70 types. 


Features 

Objective 

MAP 

GAP 


NE (m = 2) 

33.01 

23.97 

Dh-W 

NT (n = 2) 

31.61 

29.09 


Global (m = 1, n = 1) 

33.28 

31.97 


NE (m = 2) 

34.56 

21.79 

Th-Dh-W 

NT (n = 2) 

34.45 

31.42 


Global (m = 1, n = 1) 

36.13 

31.13 


(c) Global Objective vs NE and NT. Results are obtained us¬ 
ing Linear.Adagrad on 70 types. 


Features 

MAP 

GAP 

Type (T) 

12.33 

13.58 

Description (D) 

29.17 

28.17 

Wikipedia (W) 

30.81 

30.56 

Dh-W 

33.28 

31.97 

Th-Dh-W 

36.13 

31.13 


(b) Feature Comparison. Results are obtained from using Lin¬ 
ear.Adagrad with global training objective (m=L n=l) on 70 
types. 


Features 

Objective 

MAP 

GAP 


NE (m = 2) 

30.92 

22.38 

Dh-W 

NT (n = 2) 

25.77 

23.40 


Global (m = 1, n = 1) 

31.60 

30.13 


NE (m = 2) 

28.70 

19.34 

Th-Dh-W 

NT (n = 2) 

28.06 

25.42 


Global (m = 1, n = 1) 

30.35 

28.71 


(d) Global Objective vs NE and NT. Results are obtained us¬ 
ing the embedding model on 70 types. 


Features 

Model 

MAP 

GAP 

G@1000 

G@ 10000 

Dh-W 

Linear.Adagrad 

Embedding 

33.28 

31.60 

31.97 

30.13 

79.63 

73.40 

68.08 

64.69 

Th-Dh-W 

Linear.Adagrad 

Embedding 

36.13 

30.35 

31.13 

28.71 

70.02 

62.61 

65.09 

64.30 


(e) Model Comparison. The models were trained with the global training objective (m=l, n=l) on 70 types. 


Model 

MAP 

GAP 

G@1000 

G@ 10000 

Linear.Adagrad 

13.28 

20.49 

69.23 

60.14 

Embedding 

9.82 

17.67 

55.31 

51.29 


(f) Results on 500 types using Ereebase description features. We train the models with the global training objective (m=l, n=l). 
Table 2: Automatic Evaluation Results. Note that m = |A/'_E(e, f)| and n = lAfrie, f)|. 


One might expect that with the increased number of 
types, the embedding model would perform better 
than the classifier since they share parameters across 
types. However, despite the recent popularity of em¬ 
bedding models in NLP, linear model still performs 
better in our task. 

5.3 Human Evaluation 

To verify the effectiveness of our KBC algorithms, 
and the correctness of our automatic evaluation 
method, we perform manual evaluation on the top 
100 predictions of the output obtained from two dif¬ 


ferent experimental setting and the results are shown 
in Table Even though the automatic evalua¬ 
tion gives pessimistic results since the test KB is 
also incompletely the results indicate that the auto¬ 
matic evaluation is correlated with manual evalua¬ 
tion. More excitingly, among the 179 unique in¬ 
stances we manually evaluated, 17 of them are stllj^ 
missing in Ereebase which emphasizes the effective¬ 
ness of our approach. 


®This is true even with existing automatic evaluation meth¬ 
ods. 

’at submission time. 




















































Features 

G@100 

G@100-M 

Accuracy-M 

Dh-W 

87.68 

97.31 

97 

Th-Dh-W 

84.91 

91.47 

88 


Table 3; Manual vs. Automatic evaluation of top 100 pre¬ 
dictions on 70 types. Predictions are obtained by train¬ 
ing a linear classifier using Adagrad with global training 
objective (m=l, n=l). G@100-M and Accuracy-M are 
computed by manual evaluation. 


5.4 Error Analysis 

• Effect of training data; We find the perfor¬ 
mance of the models on a type is highly de¬ 
pendent on the number of training instances for 
that type. For example, the linear classifier 
model when evaluafed on 70 fypes performs 
24.86 % better on the most frequent 35 types 
compared to the least frequent 35 types. This 
indicates bootstrapping or active learning tech¬ 
niques can be profitably used to provide more 
supervision for the methods. In this case, G@k 
would be an useful metric to compare the effec¬ 
tiveness of the different methods. 

• Shallow Linguistic features: We found some 
of the false positive predictions are caused by 
the use of shallow linguistic features. For ex¬ 
ample, an entity who has acted in a movie and 
composes music only for television shows is 
wrongly tagged with the type /film/composer 
since words like ’’movie”, ’’composer” and 
’’music” occur frequently in the Wikipedia arti¬ 
cle of the entity (http : //en . wikipedia . 
org/wiki/J ._J ._Abrams). 


6 Related Work 


Entity Type Prediction and Wikipedia Eeatures 

Much of previous work (Pantel et ah, 2012} Ling 


and Weld, 20121 in entity type prediction has fo¬ 
cused on the task of predicting entity types at the 
sentence level. Yao et al. (2013| ) develop a method 
based on matrix factorization for entity type predic¬ 
tion in a KB using information within the KB and 
New York Times articles. However, the method was 


still evaluated only at the sentence level. Toral and 


Munoz (2006| ), [Kazama and Torisawa (20071 1 use the 
first line of an entity’s Wikipedia article to perform 
named entity recognition on three entity types. 


Knowledge Base Completion Much of precious 
work in KB completion has focused on the problem 
of relation extraction. Majority of the methods infer 
missing relation facts using information within the 


KB (Nickel et al, 2011 

Lao et ah, 2011 

S ocher et 

al, 2013 

Bordes et ah, 2013 

1 while methods such 

as Mintz et al. (2009|l use inf 

ormation in text doc- 
use both information 

uments. 

Tiedel et al. (2013D 


within and outside the KB to complete the KB. 


Linear Embedding Model Weston et al. (20111 is 


one of first work that developed a supervised linear 
embedding model and applied it to image retrieval. 
We apply this model to entity type prediction but 
we train using a different objective function which 
is more suited for our task. 


7 Conclusion and Future Work 

We propose an evaluation framework comprising 
of methods for dataset construction and evaluation 
metrics to evaluate KBC approaches for inferring 
missing entity type instances. We verified that our 
automatic evaluation is correlated with human eval¬ 
uation, and our dataset and evaluation scripts are 
publicly available!^ Experimental results show that 
models trained with our proposed global training ob¬ 
jective produces higher quality ranking within and 
across types when compared to baseline methods. 

In future work, we plan to use information from 
entity linked documents to improve performance 
and also explore active leaning, and other human- 
in-the-loop methods to get more training data. 
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